Pandas is a Python package that aims to make working with data as easy and intuitive as possible. It fills the role of a foundational real world data manipulation library and interfaces with many other Python packages.
By the end of this file you should have seen simple examples of:
Further Reading:
http://pandas.pydata.org/pandas-docs/stable/10min.html
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join
Image Credit: David Jenkins at Bifengxia Panda Reserve in Chengdu
In [1]:
# Python imports
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
There are two main data structures in pandas:
Here, the 1- and 2-dimensional data sets are the focus of this lesson.
Pandas DataFrames are analogus to R's data.frame, but aim to provide additional functionality.
Both dataframes and series data structures have indicies, which are shown on the left:
In [2]:
series1 = pd.Series([1,2,3,4])
print(series1)
Dataframes use the IPython display method to look pretty, but will show just fine when printed also. (There's a way to make all of the dataframes print pretty via the IPython.display.display method, but this isn't necessary to view the values):
In [3]:
df1 = pd.DataFrame([[1,2,3,4],[10,20,30,40]])
print(df1)
df1
Out[3]:
Indices can be named:
In [4]:
# Rename the columns
df1.columns = ['A','B','C','D']
df1.index = ['zero','one']
df1
Out[4]:
In [5]:
# Create the dataframe with the columns
df1 = pd.DataFrame([[1,2,3,4],[10,20,30,40]], columns=['A','B','C',"D"], index=['zero','one'])
df1
Out[5]:
In [6]:
df1 = pd.DataFrame(np.random.randn(5,4), columns = ['A','B','C','D'], index=['zero','one','two','three','four'])
print(df1)
In [7]:
df1.to_csv('datafiles/pandas_df1.csv')
!ls datafiles
In [8]:
df2 = pd.read_csv('datafiles/pandas_df1.csv', index_col=0)
print(df2)
In [9]:
df1.to_hdf('datafiles/pandas_df1.h5', 'df')
!ls datafiles
In [10]:
df2 = pd.read_hdf('datafiles/pandas_df1.h5', 'df')
print(df2)
In [11]:
df2.dtypes
Out[11]:
We can create dataframes of multiple datatypes:
In [12]:
col1 = range(6)
col2 = np.random.rand(6)
col3 = ['zero','one','two','three','four','five']
col4 = ['blue', 'cow','blue', 'cow','blue', 'cow']
df_types = pd.DataFrame( {'integers': col1, 'floats': col2, 'words': col3, 'cow color': col4} )
print(df_types)
In [13]:
df_types.dtypes
Out[13]:
We can also set the 'cow color' column to a category:
In [14]:
df_types['cow color'] = df_types['cow color'].astype("category")
df_types.dtypes
Out[14]:
In [15]:
time_stamps = pd.date_range(start='2000-01-01', end='2000-01-20', freq='D') # Define index of time stamps
df1 = pd.DataFrame(np.random.randn(20,4), columns = ['A','B','C','D'], index=time_stamps)
print(df1)
In [16]:
df1.head(3) # Show the first n rows, default is 5
Out[16]:
In [17]:
df1.tail() # Show the last n rows
Out[17]:
We can also separate the metadata (labels, etc) from the data, yielding a numpy-like output.
In [18]:
df1.columns
Out[18]:
In [19]:
df1.values
Out[19]:
In [20]:
df1['A'].head() # df1.A.head() is equivalent
Out[20]:
Note that tab completion is enabled for column names:
In [21]:
df1.A
Out[21]:
We can specify row ranges:
In [22]:
df1[:2]
Out[22]:
In [23]:
df1.loc[:'2000-01-5',"A"] # Note that this includes the upper index
Out[23]:
In [24]:
df1.iloc[:3,0] # Note that this does not include the upper index like numpy
Out[24]:
In [25]:
index_timestamp = pd.Timestamp('2000-01-03') # Create a timestamp object to index
df1.at[index_timestamp,"A"] # Index using timestamp (vs string)
Out[25]:
In [26]:
df1.iat[3,0]
Out[26]:
In [27]:
df1.head()>0.5
Out[27]:
That matrix can then be used to index the DataFrame:
In [28]:
df1[df1>0.5].head() # Note that the values that were 'False' are 'NaN'
Out[28]:
In [29]:
df_types
Out[29]:
In [30]:
bool_series = df_types['cow color'].isin(['blue'])
print(bool_series) # Show the logical indexing
df_types[bool_series] # Index where the values are true
Out[30]:
In [31]:
df_types.sort_values(by="floats")
Out[31]:
In [32]:
df_nan = pd.DataFrame(np.random.rand(6,2), columns = ['A','B'])
df_nan
Out[32]:
In [33]:
df_nan['B'] = df_nan[df_nan['B']>0.5] # Prints NaN Where ['B'] <= 0.5
print(df_nan)
Print a logical DataFrame where NaN is located:
In [34]:
df_nan.isnull()
Out[34]:
Drop all rows with NaN:
In [35]:
df_nan.dropna(how = 'any')
Out[35]:
Replace NaN entries:
In [36]:
df_nan.fillna(value = -1)
Out[36]:
In [37]:
df1 = pd.DataFrame(np.zeros([3,3], dtype=np.int))
df1
Out[37]:
In [38]:
df2 = pd.concat([df1, df1], axis=0)
df2 = df2.reset_index(drop=True) # Renumber indexing
df2
Out[38]:
In [39]:
newdf = pd.DataFrame({0: [1], 1:[1], 2:[1]})
print(newdf)
df3 = df2.append(newdf, ignore_index=True)
df3
Out[39]:
In [40]:
left = pd.DataFrame({'numbers': ['K0', 'K1', 'K2', 'K3'],
'English': ['one', 'two', 'three', 'four'],
'Spanish': ['uno', 'dos', 'tres', 'quatro'],
'German': ['erste', 'zweite','dritte','vierte']})
left
Out[40]:
In [41]:
right = pd.DataFrame({'numbers': ['K0', 'K1', 'K2', 'K3'],
'French': ['un', 'deux', 'trois', 'quatre'],
'Afrikaans': ['een', 'twee', 'drie', 'vier']})
right
Out[41]:
In [42]:
result = pd.merge(left, right, on='numbers')
result
Out[42]:
In [43]:
dfg = pd.DataFrame({'A': ['clogs','sandals','jellies']*2,
'B': ['socks','footies']*3,
'C': [1,1,1,3,2,2],
'D': np.random.rand(6)})
dfg
Out[43]:
In [44]:
dfg.pivot_table(index=['A','B'], columns=['C'], values='D')
Out[44]:
In [45]:
dfg.stack()
Out[45]:
In [46]:
dfg.groupby(['B']).count()
Out[46]:
In [47]:
dfg.groupby(['A']).mean()
Out[47]:
In [48]:
dfg['D'].mean()
Out[48]:
In [49]:
dfg['D']
Out[49]:
In [50]:
dfg_Ds = dfg['D'].shift(2)
dfg_Ds
Out[50]:
In [51]:
dfg['D'].div(dfg_Ds )
Out[51]:
In [52]:
dfg
Out[52]:
In [53]:
dfg['C'].value_counts()
Out[53]:
In [54]:
df_types.describe()
Out[54]:
In [55]:
df_types.T
Out[55]:
In [56]:
def f(x): # Define function
return x + 1
dfg['C'].apply(f)
Out[56]:
Lambda functions may also be used
In [57]:
dfg['C'].apply(lambda x: x + 1)
Out[57]:
In [58]:
dfg['A'].str.title() # Make the first letter uppercase
Out[58]:
In [59]:
n = 100
X = np.linspace(0, 5, n)
Y1,Y2 = np.log((X)**2+2), np.sin(X)+2
dfp = pd.DataFrame({'X' : X, 'Y1': Y1, 'Y2': Y2})
dfp.head()
Out[59]:
In [60]:
dfp.plot(x = 'X')
plt.show()
Matplotlib styles are available too:
In [61]:
style_name = 'classic'
plt.style.use(style_name)
dfp.plot(x = 'X')
plt.title('Log($x^2$) and Sine', fontsize=16)
plt.xlabel('X Label', fontsize=16)
plt.ylabel('Y Label', fontsize=16)
plt.show()
In [62]:
mpl.rcdefaults() # Reset matplotlib rc defaults